Predicting the impact of genotype-by-genotype interaction on the purebred–crossbred genetic correlation from phenotype and genotype marker data of parental lines

Background The genetic correlation between purebred (PB) and crossbred (CB) performances (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${r}_{pc}$$\end{document}rpc) partially determines the response in CB when selection is on PB performance in the parental lines. An earlier study has derived expressions for an upper and lower bound of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${r}_{pc}$$\end{document}rpc, using the variance components of the parental purebred lines, including e.g. the additive genetic variance in the sire line for the trait expressed in one of the dam lines. How to estimate these variance components is not obvious, because animals from one parental line do not have phenotypes for the trait expressed in the other line. Thus, the aim of this study was to propose and compare three methods for approximating the required variance components. The first two methods are based on (co)variances of genomic estimated breeding values (GEBV) in the line of interest, either accounting for shrinkage (VCGEBV-S) or not (VCGEBV). The third method uses restricted maximum likelihood (REML) estimates directly from univariate and bivariate analyses (VCREML) by ignoring that the variance components should refer to the line of interest, rather than to the line in which the trait is expressed. We validated these methods by comparing the resulting predicted bounds of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${r}_{pc}$$\end{document}rpc with the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${r}_{pc}$$\end{document}rpc estimated from PB and CB data for five traits in a three-way cross in pigs. Results With both VCGEBV and VCREML, the estimated \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${r}_{pc}$$\end{document}rpc (plus or minus one standard error) was between the upper and lower bounds in 14 out of 15 cases. However, the range between the bounds was much smaller with VCREML (0.15–0.22) than with VCGEBV (0.44–0.57). With VCGEBV-S, the estimated \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${r}_{pc}$$\end{document}rpc was between the upper and lower bounds in only six out of 15 cases, with the bounds ranging from 0.21 to 0.44. Conclusions We conclude that using REML estimates of variance components within and between parental lines to predict the bounds of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${r}_{pc}$$\end{document}rpc resulted in better predictions than methods based on GEBV. Thus, we recommend that the studies that estimate \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${r}_{pc}$$\end{document}rpc with genotype data also report estimated genetic variance components within and between the parental lines. Supplementary Information The online version contains supplementary material available at 10.1186/s12711-022-00773-z.


Background
The genetic correlation between purebred (PB) and crossbred (CB) performances ( r pc ) is an important parameter in crossbreeding programs, because the r pc partially determines the response in CB performance when selection is based on PB performance in the parental lines [1,2]. Thus; low values of r pc may indicate that CB information should be collected when the aim is to improve CB performance.
In livestock populations where crossbreeding is applied, r pc is typically lower than 1. For example, the average estimated r pc across traits was about 0.63 in pigs [3], and about 0.71 in poultry [4]. Two phenomena can lead to r pc values being lower than 1, namely genotypeby-environment interactions (GxE) and genotype-bygenotype interactions (GxG, i.e. non-additive genetic effects).
r pc is defined as the correlation between the additive genetic values of PB individuals, for PB and CB performances [2]. CB performance can be seen as a trait that is not expressed in the PB breeding individuals themselves, but in their CB offspring. The r pc in a given breeding line (e.g. line 1) when mated to another breeding line is defined as: where σ 1 is the standard deviation of the additive genetic values for PB performance of individuals in line 1, σ 1(C) is the standard deviation of the additive genetic values of individuals in line 1 for the trait expressed in CB, and σ 1,1(C) is the covariance between the additive genetic values in line 1 for PB and CB performances [2,5]. Suppose the causal loci and their true average effects of allele substitution (hereafter called average effects) are known, the vector of additive genetic values of individuals in line 1 for the trait expressed in CB animals ( u 1(C) ) could be obtained as the product of the matrix of genotypes of individuals in line 1, and the vector of average effects expressed in CB animals. σ 1(C) could then be computed as the standard deviation of u 1(C) , and σ 1,1(C) as the covariance between u 1(C) and the vector of additive genetic values of individuals in line 1 for the trait expressed in line 1 (i.e. for PB performance, u 1 ). r pc can be estimated from phenotypes measured on both PB and CB animals when the pedigree relationships between these animals are known. However, in practice collecting CB phenotypes can be costly, and the pedigree of CB animals is often not recorded. The need for pedigree records can be alleviated by genotyping the PB and CB animals that have phenotypes. The genotypes provide a link between PB and CB performances, enabling the estimation of r pc . Although this approach can lead to (1) accurate estimates of r pc [6][7][8], it requires large investments for phenotyping and genotyping CB animals, while it is uncertain that using such data yields more profit. Therefore it may be beneficial for breeders to use predictions of r pc based only on information from the parental lines.
In a previous study, Duenk et al. [9] derived expressions for the approximate bounds of r pc based on true variance components computed from the effects of quantitative trait loci (QTL) and the genotypes in the parental lines. These expressions predict the lower and upper bounds of r pc when only GxG and no GxE interaction is present. For brevity, we will simply use the symbol r pc , but this symbol should be interpreted as the decrease of r pc due to GxG interactions, throughout the manuscript. The expressions for the bounds of r pc allow to predict r pc without the use of data from CB animals. For example, the approximate lower bound of r pc in line 1 can be predicted as: In Eq. (2), σ 1 is the standard deviation of the additive genetic values for PB performance in line 1 that is typically estimated from data used in routine evaluations. However, the other two variance components in Eq. (2) differ from the genetic parameters that are typically estimated. σ 1(2) is the standard deviation of the additive genetic values of individuals in line 1 (i.e., the line for which r pc is predicted, called the focal line), for the trait expressed in line 2 (called the trait line). Suppose the causal loci and their true average effects are known, the vector of additive genetic values of individuals in line 1 for the trait expressed in line 2 ( u 1(2) ) could be obtained as the product of the matrix of genotypes of individuals in line 1, and the vector of average effects expressed in line 2. σ 1(2) could then be computed as the standard deviation of u 1(2) , and σ 1,1 (2) as the covariance between u 1(2) and u 1 . In other words, σ 1(2) is a function of the allele frequencies in line 1 and of the average effects for PB performance in line 2. For example, for a single locus, σ 1 (2) would be the square root of 2p 1 (1 − p 1 )α 2 2 , where p 1 is the allele frequency in line 1, and α 2 is the average effect in line 2. Similarly, σ 1,1(2) is a function of the allele frequencies in line 1 and of the average effects for PB performance in line 1 and line 2. Although obtaining these two variance components is straightforward when QTL effects and genotypes in the parental lines are known [9], this is not the case when estimates need to be obtained from phenotypes and marker genotypes in the parental lines.
In this study, we propose and compare three methods for approximating the required variance components to predict the bounds of r pc . The first two methods are based on the variances of and covariances between genomic estimated breeding values (GEBV), assuming that GEBV are useful proxies for true additive genetic values. The third method is based on REML estimates of the variances within and covariances between the parental lines, assuming that the variance of the genotypes (i.e. 2p(1 − p) ) across lines is similar. We assess the performance of these three methods by comparing predicted bounds with the estimated value of r pc for five traits using data on a three-way cross in pigs.

Dataset
We used phenotypic and genotypic data from three PB lines and their three-way terminal CB, provided by Topigs Norsvin and Norsvin. Sires from a Landrace line (LR) were mated to dams from a Large White line (LW) to produce F1 sows, which were crossed with sires from a synthetic boar line (S) to produce the three-way CB (i.e. Sx(LRxLW)). In total, we used 17,100 animals from line S, 6611 animals from line LR, 8587 animals from line LW, and 4173 three-way CB. The synthetic boar line S is a combination of the LW and Piétrain breeds and was created around 1975 [10]. All purebreds were housed in a nucleus environment, and the crossbreds in a commercial environment, possibly giving rise to GxE interaction between PB and CB performances. Phenotypes were pre-corrected for fixed effects and common litter effects using a larger dataset during the routine genetic evaluation of Topigs Norsvin. The traits included average daily gain during the test period between 30 and 120 kg (TGR), lifetime daily gain (LGR, from birth until 120 kg), daily feed intake during the test period (DFI), backfat (BFE) and loin depth (LDE). In the PB animals, BFE and LDE were measured at the end of the test period using an ultrasound device. In the CB animals, BFE and LDE were measured on the carcass after slaughter at the slaughterhouse. About 80% of the animals were genotyped with the Illumina 50K single nucleotide polymorphism (SNP) chip and the remaining 20% were genotyped with a custom Illumina 25K SNP chip. All genotypes were imputed to 50K within each purebred line using the FImpute v2.2 software [11]. First, the genotypes were imputed within the purebred lines. Second, the imputed genotypes from all purebred lines were used as a reference population to impute the genotypes of crossbred animals. For quality control, we excluded SNPs with a GenCall rate lower than 0.15 (Illumina Inc., 2005) or a call rate lower than 0.95, SNPs that showed a strong deviation from Hardy-Weinberg equilibrium (χ 2 > 600), SNPs that were located on sex chromosomes, and unmapped SNPs. The positions of the SNPs were based on the Sscrofa11.1 assembly. Finally, SNPs with a minor allele frequency (MAF) lower than 0.01 in any of the lines were excluded, yielding 35,595 SNPs that were used in all the analyses. All genotyped animals had a frequency of missing genotypes lower than 0.05 and were therefore retained for further analyses.

Derived expressions for the bounds of r pc
In a previous study, Duenk et al. [9] derived expressions that predict r pc from the genetic variance components of the purebred parental lines. They focused on predicting the decrease of r pc due to non-additive effects in a two-way cross, considering two scenarios (genetic models). The first genetic model assumed that there were only additive and dominance genetic effects, and the second genetic model assumed that there were only additive and additive-by-additive (AxA) epistatic genetic effects. The first step was to express the average effects for CB performance ( α CB ) in the parental line of interest (i.e. the focal line) in terms of average effects for PB performance in the parental lines, because the difference between these average effects determines the value of r pc [2,9]. The results showed that for the dominance model, α CB in the focal line is equal to the average effect for PB performance in the mated line. As a result, the dominance model may represent a lower bound for r pc , because the maximum difference between the average effects for PB and CB performances is then bounded by the difference in allele frequencies between the parental lines. For the AxA epistatic model, α CB in the focal line is equal to the mean of the average effects for PB performance in both parental lines. As a result, the epistatic model may represent an upper bound of r pc , because the minimum difference between the average effects for PB and CB performances is bounded by the difference in allele frequencies between the parental line and the cross. Duenk et al. [9] showed that the derived expressions indeed indicate approximate lower and upper bounds of r pc , regardless of the actual genetic model.

Predicting bounds of r pc
The present study builds on the work of Duenk et al. [9] and aims at validating the derived equations for the bounds of r pc in real data by evaluating whether the estimated r pc falls between the predicted bounds. We predicted the bounds of r pc for each parental line based on information from all parental lines, following the expressions for the lower and upper bounds of r pc in a three-way cross derived by Duenk et al. [9]. These equations were derived by expressing the additive genetic values of the different groups of animals as products of genotype matrices with vectors of average effects in the parental lines, to allow re-writing the variance components that appear in Eq. (1) in terms of variance components in the PB lines. Full derivations are presented in the appendix of Duenk et al. [9]. The predicted lower bound of r pc in sire line S for the three-way CB is: , expressed in line S and LR, σ LW (LW ),LW (LR) is the covariance between the additive genetic values of individuals in line LW for the trait expressed in line LW and LR, and the other parameters are the same as those in Eq. (5). The expression for the lower and upper bounds of r pc in dam line LR are obtained by replacing subscript LW with LR, and vice versa, in Eqs. (5) and (6).

Approximating variance components
Equations (3) to (6) show that we need the genetic variance components of the parental lines, which are not usually available, such as the additive genetic standard deviation in individuals of line S, for the trait expressed in line LW ( σ S(LW ) ). These across-line variance component cannot be estimated directly, because there are no phenotypes of the individuals from line S for the trait expressed in line LW. Therefore, here we compare three methods for approximating such variance components. The first method (VC GEBV ) relies on the assumption that the GEBV are estimates of true additive genetic values. Thus, the GEBV may be used as proxies for the additive genetic values in the focal line, and the required variance components can be approximated by computing the covariances between these GEBV and their variances. A disadvantage of VC GEBV is that the variance of GEBV is expected to be smaller than the variance of the additive genetic values ( σ 2 a ) due to shrinkage (i.e. the variance of GEBV is equal to r 2 σ 2 a , where r 2 is the reliability of GEBV), resulting in biased estimates of variance components. Thus, we included a second method (VC GEBV-S ), where the GEBV were corrected for shrinkage by dividing them by the square root of their reliabilities.
The third method for approximating variance components was based on REML estimates (VC REML ). It uses estimates of genetic variances resulting from a univariate analysis of the trait line, and estimates of genetic covariances resulting from pairwise bivariate analyses between parental lines. On the one hand, this method ignores that the required across-line variance components should correspond to individuals in the focal line, and instead uses the components estimated in the trait line. The across-line variance in the focal line can differ from the variance in the trait line due to differences in genotype frequencies between lines. On the other hand, REML estimates are generally accurate, and commonly reported. Furthermore, this method may result in accurate approximations when the genotype frequencies in the focal line are similar to those in the trait line.
Note that, for each of these methods, all variance components in Eqs. (3) to (6) were approximated using the method described, including the within-line variance components such as σ S .

Methods based on GEBV (VC GEBV and VC GEBV-S )
With VC GEBV, the required variance components are approximated by using the GEBV as proxies for true additive genetic values. We first analysed phenotypes in each PB line separately using a univariate model, while including genotypes of all PB lines to obtain GEBV for all PB animals. For example, for the phenotypes in line S, the model was: where y S is the vector of corrected phenotypes, µ S is the mean, u S is the vector of additive genetic values with incidence matrix Z , and e is the vector of independent random residuals. The additive genetic values were distributed as u S ∼ N 0, σ 2 s G , where σ 2 s is the additive genetic variance in line S, and G is the multi-breed genomic relationship matrix that includes all PB animals and is constructed following Wientjes et al. [12]. Note that this approach is equivalent to estimating marker effects in line S, and then multiplying these estimates with marker genotypes of all PB animals [13,14]. The model results in GEBV for all individuals from lines S, LW, and LR, for the trait expressed in line S ( u All(S) ). The same approach was used to estimate GEBV of all PB animals for the trait expressed in lines LW ( u All(LW ) ) and LR ( u All(LR) ), using the phenotypes of those lines.
The variances that appear in Eqs. Variances and covariances of GEBV can be poor approximations of genetic variance components, because GEBV are subject to shrinkage. For example, the variance of GEBV is a factor r 2 of the corresponding additive genetic variance, where r 2 is the reliability of the GEBV [15]. Therefore, with the second method (VC GEBV-S ), we used the same approach as VC GEBV , but we divided the GEBV by the square-root of their individual reliabilities to correct for shrinkage. The individual reliabilities were computed by the MTG2 program [16], using the prediction error variance [17]. This procedure aims at scaling the GEBV such that their variance is independent of their reliabilities. The variances of and the covariances between the scaled GEBV were used to approximate the parameters in Eqs. (3) to (6). Note that this approach is similar to the approach of Calo et al. [18] and Blanchard et al. [19], except that here we perform scaling at the level of the GEBV while allowing for different individual reliabilities, whereas in [18] and Blanchard et al. [19] scaling is performed at the level of the variances and covariances using the sum of the reliabilities across individuals.

Method based on REML estimates (VC REML )
With the third method (VC REML ), we used genomic REML estimates of variance components within and between lines to approximate the required parameters. Approximated variances and standard deviations resulted from univariate genomic analyses in the trait lines. For example, σ S(LW ) and σ LR(LW ) were approximated as the REML estimate of the additive genetic standard deviation in line LW ( σ LW ). The model used for this analysis was: where y LW is the vector of corrected phenotypes in line LW, µ LW is the mean, u LW is the vector of additive genetic values with incidence matrix Z , and e LW is the vector of independent random residuals. The additive genetic values were distributed as u LW ∼ N 0, σ 2 LW G LW , where σ 2 LW is the additive genetic variance in line LW, G LW is the genomic relationship matrix and is constructed following method 2 of VanRaden [20]. The REML estimate σ LW is used as an approximation for the standard deviation of the additive genetic values for the trait expressed in line LW (i.e. σ LW , σ S(LW ) and σ LR(LW ) ). The same approach was used to approximate variance components for the trait expressed in lines S ( σ S , σ LW (S) and σ LR(S) ) and LR ( σ LR , σ S(LR) and σ LW (LR) ), using data of those lines.
Approximate covariances resulted from pairwise bivariate genomic analyses for each combination of two parental lines. For example, the covariance σ S(LW ),S(LR) was approximated as the REML estimate of the additive genetic covariance between lines LW and LR ( σ LW ,LR ), resulting from the bivariate analysis of performance in lines LW and LR. The model can be written as [21,22]: where y is the vector of corrected phenotypes, µ is the mean, 1 is a column vector of 1s, u is the vector of additive genetic values with incidence matrix Z , and e is the vector of independent random residuals. Subscripts denote whether the terms relate to LW or LR animals. The distribution of the additive genetic values was: where σ 2 LW is the additive genetic variance in line LW, σ 2 LR is the additive genetic variance in line LR, and σ LW ,LR is the additive genetic covariance between lines LW and LR. The genomic relationship matrix between all individuals ( G ) was calculated following Wientjes et al. [12]. The variance components were estimated using REML in the MTG2 software [16] . The same approach was used to estimate covariances for each combination of two parental lines.
In addition to REML estimates of genetic variances and covariances, the bivariate analyses resulted in estimates of genetic correlations between parental lines (computed as r g = σ 1,2 σ 1 σ 2 ). We report these genetic correlations in the Results section because they are related to (the predicted bounds of ) r pc . Both r g and r pc depend on the size of the non-additive genetic effects and the differences in allele frequencies between parental lines [9,23].

Estimating r pc and validation of predicted bounds
We estimated r pc for each of the parental lines using a bivariate model that treats PB and CB performances as different, but genetically correlated traits. Note that we did not combine data from all PB lines in a single analysis, but we estimated r pc for each parental line, separately, using three bivariate models. The statistical model was similar to the bivariate model that was used to estimate REML estimates of covariances between lines (Eq. (9)), except that the data used were from PB and CB animals, instead of from lines LW and LR. The variance components were estimated using REML in the MTG2 software [16], and r pc was estimated as r pc = σ PB,CB σ PB σ CB . We validated the predicted bounds by comparing them with the estimated r pc . Note that, in these comparisons, we consider the predicted bounds to be correct if the interval of the estimated r pc plus and minus one standard error of the estimate overlapped with the interval between the predicted bounds. Although the predicted bounds may capture the estimated r pc , the bounds may (9) not be very informative if they cover an extremely large range. For example, a predicted lower bound of − 1 and a predicted upper bound of 1 will always capture estimated r pc , but such a prediction is not useful at all. Thus, we computed the range of bounds as the difference between the predicted upper and lower bounds of r pc .

Results
First, we compare the predicted bounds of r pc with the estimated r pc . Second, we present the results on estimated genetic parameters such as heritabilities, r pc , and genetic correlations between lines.

Predicted bounds of r pc
Of the 15 trait-line combinations, the estimated r pc (plus or minus one standard error) fell between the predicted lower and upper bounds in 14 cases with VC GEBV , 6 cases with VC GEBV-S , and 14 cases with VC REML (Fig. 1). With methods based on GEBV, the estimated r pc was usually closer to the upper bound than to the lower bound, while for VC REML , this varied across traits and lines.

Estimated genetic parameters
The estimated heritabilities for PB performance across traits ranged from 0.23 (TGR for line S) to 0.54 (BFE for line LW) ( Table 2). Across traits, the average heritability was lowest in line S (0.36), followed by lines LR (0.39) and LW (0.44). Across lines, the average heritability was lowest for TGR (0.32), and highest for BFE (0.48).
The estimated r pc across traits and lines ranged from 0.52 (TGR in line LR) to 0.80 (LDE in line LW), with an average of 0.72 (Table 3). On average across traits, the estimated r pc was highest in line S (0.78), followed by lines LW (0.74) and LR (0.64). On average across lines, the estimated r pc was lowest for TGR (0.65) and highest for LDE (0.77). The standard errors across traits and lines ranged from 0.04 (DFI and BFE in line S) to 0.11 (LDE in line LR), with an average of 0.06. On average, estimated r pc in line S had the smallest standard errors (0.05), followed by lines LW (0.06) and LR (0.08).
The estimated genetic correlation between lines ( r g ) across traits and line combinations ranged from 0.39 (TGR for LR-LW) to 0.75 (BFE for S-LW) ( Table 4). On average across traits, the estimated r g was lowest between lines LR and LW (0.48), followed by S-LR (0.53) and S-LW (0.69). On average across lines, the estimated r g was lowest for DFI (0.49), and highest for BFE (0.70). The standard errors across traits and lines ranged from 0.03 (BFE for S-LW) to 0.08 (LDE, LGR and TGR for LR-LW), with an average of 0.06.

Discussion
The objective of this study was to empirically validate expressions that predict the bounds of r pc based on variance components of the parental lines, without requiring CB data. Our results showed that it is indeed possible to predict the bounds of r pc , and that REML estimates provide useful approximations for the required variance components.

Methods based on GEBV (VC GEBV and VC GEBV-S )
The first method to predict the bounds of r pc was based on the variances of and covariances between GEBV of individuals in the focal line (VC GEBV ). Compared to the other methods we proposed, the advantage of VC GEBV is that the resulting variance components correspond to the individuals in the focal line, and not to the mated lines. Our results showed that VC GEBV resulted in a high fraction of cases where the predicted bounds captured the estimated r pc , but that the range between the lower and upper bounds was relatively large.
A disadvantage of VC GEBV is that the variance of GEBV is expected to be smaller than the variance of additive genetic values due to shrinkage, resulting in biased estimates of variance components. Thus, we included a second method (VC GEBV-S ), where GEBV were corrected for shrinkage by dividing them by the square root of their reliabilities. Although VC GEBV-S generally resulted in a smaller range than VC GEBV , it also resulted in a high fraction of cases where the estimated r pc was higher than the predicted upper bound, especially in lines LR and LW.

Method based on REML (VC REML )
The third method for approximating variance components was based on REML estimates (VC REML ). Our results show that, across traits and lines, VC REML resulted in more accurate prediction of the bounds of r pc than VC GEBV and VC GEBV-S because it resulted in the highest fraction of cases (14 out of 15) where the predicted bounds captured the estimated r pc , and the smallest range between the lower and upper bounds. This result was somewhat surprising, because the across-line variance components from VC REML do not correspond to the individuals of the focal line, but to those of the trait line (i.e., the mated lines from which phenotypes were analysed). We therefore investigated whether the within-line variance of the trait line was comparable to the across-line variance of the focal line. The within-line variance in the trait line is proportional to heterozygosity ( 2p i (1 − p i ) ) in that line [5]. In contrast, the variance in the focal line for the trait expressed in the other line is proportional to the heterozygosity in the focal line. In this study, the difference in heterozygosity between parental lines was small, although estimated additive genetic variances differed considerably (see Additional file 1: Table S1). This similarity suggests that the across-line variances in the focal lines are similar to the corresponding withinline variances in the trait lines, and that therefore VC REML results in relatively accurate approximations of across-line variance components. Thus, we conclude that, when the heterozygosities in the parental lines are comparable, the bounds of r pc can be relatively well predicted from REML estimates of the genetic variances within and the covariances between parental lines.
For other combinations of PB parental lines, the difference in heterozygosity between lines may be larger than in this study. Larger differences can occur when parental lines differ strongly in their history, for example due to differences in selection or in effective population size, or due to population admixture or migration. In those cases, the VC REML method may be improved by scaling the REML estimates to the focal line. For example, σ 2 S(LW ) may be approximated by multiplying the REML estimate of within-line variance in Similarly, σ S,S(LW ) may be approximated by multiplying the REML estimate However, this scaling procedure may not necessarily affect the predicted bounds, because scaling occurs in both the  numerators and denominators of the equations to predict r pc . To test this effect of scaling, we considered an extreme hypothetical example where the heterozygosity ( 2p i (1 − p i ) ) in three parental lines (A, B, and C) differed greatly: the heterozygosity in line B was twice as large as in line A, and in line C it was four times as large as in line A. In this example, we used the REML estimates of the DFI trait presented in this study to predict the bounds of r pc in all parental lines, and compared the predicted bounds when the across-line variance components were either scaled to the focal line, or not. The results showed that scaling resulted in a difference between predicted bounds of at most 0.06 (results not shown). In fact, the predicted lower bound in the dam lines did not change at all, because the scaling is completely cancelled out in the prediction equation. These results suggest that using VC REML may yield useful predictions of the bounds of r pc , even when the parental lines differ strongly in heterozygosity.

Estimation of genetic parameters
The estimated r pc ranged from 0.52 to 0.80 across lines and traits, which matches with the range reported in Wientjes and Calus [3] and with estimates from recent studies [24][25][26]. For most traits, r pc estimates were higher in line S than in lines LW and LR. This result was expected, because 50% of the CB genome originates from sire line S, whereas only 25% of the CB genome originates from dam lines LW and LR. For all traits, the r pc was lowest in line LR, probably because line LR is distantly related to the other lines S and LW, whereas lines S and LW themselves are more closely related. We estimated r pc using a bivariate model that considered PB and CB performances as different, but correlated traits. In this model, genomic relationships between PB and CB animals were computed using all the alleles in the CB animals, ignoring that part of the alleles originated from other PB lines and did not contribute to the relationship with the PB line of interest. Ideally, genomic relationships between PB and CB animals should be computed using only the alleles in CB animals that originated from the PB line of interest, by tracing the breedof-origin of alleles (BOA) [27]. However, in an empirical study of broiler chicken, estimates of r pc from models that ignored or considered the BOA were very similar [6]. The authors argued that, when ignoring the BOA, the alleles from the mated lines do not contribute to variation in relationships between the CB and the parental lines of interest [28,29]. As a result, the estimation of variance components was dominated by the alleles in CB that originated from the parental line of interest. This result may suggest that considering the BOA in the current study would not have affected the estimates of r pc .
Across PB lines and traits, the heritability estimates ranged from 0.23 to 0.54, and were generally in line with estimates from recent studies [24][25][26][30][31][32]. It should be noted that the estimated heritabilities reported in this study did not account for the fact that phenotypes were pre-corrected for random common litter effects. The heritability estimates reported here may therefore be somewhat inflated. For all traits, the heritability estimate for CB performance was lower than that for PB performance, which is in line with results from Zumbach et al. [33]. For most traits (DFI, LDE, LGR, and TGR), this difference was due to smaller estimated additive genetic variances for CB than for PB performance (see Additional file 1: Table S1), while for the other traits (BFE and LDE), this difference was due to greater estimated residual variances for CB than for PB performance (see Additional file 1: Table S2).
The estimated genetic correlations between lines ( r g ) ranged from 0.39 to 0.75 across traits. These estimates indicate that the genes and their effects underlying the studied traits are at least partially the same across parental lines. To our knowledge, genetic correlations between pig breeds have not been previously reported for production traits. The highest genetic correlations were found between lines S and LW, which was expected, because the synthetic boar lines S historically originated from individuals of line LW.

Practical considerations
Although the prediction of r pc with the approach proposed in this study does not require data on CB animals, it does require data on all the PB parental lines that are used to make the commercial cross. In practice, however, phenotypes of certain traits may not be collected routinely in all the parental lines. For example, phenotypes for reproduction traits may not be collected in the sire line. Therefore, for some traits, the methods proposed in this study may require that additional data are collected on the PB parental lines.
The present study shows that the bounds of r pc can be predicted based on parental line information, but that the exact value of r pc may remain unknown. However, these bounds may indicate whether it is worthwhile to collect data on CB animals to estimate r pc and to predict the breeding values for CB performance in the parental lines. For example, if the predicted upper bound of r pc is lower than 0.85, it is very likely that the true r pc is close to or below that value, in which case collection of CB data is expected to be beneficial [34]. In the case of Xiang et al. [35], the predicted upper bound of r pc for total number born based on reported variance components would be 0.84, which suggests that the collection of CB data would be beneficial. The estimated r pc in that study (0.67 ± 0.10) was indeed lower than this predicted upper bound.

Conclusions
In conclusion, using REML estimates of variance components within and between parental lines to predict the bounds of r pc resulted in better predictions than methods based on GEBV. If confirmed with other datasets, this approach may help breeders to predict the benefit of collecting CB data based only on parental line information. Thus, we recommend that studies that estimate r pc with genotype data also report estimated genetic variance components within and between the parental lines, by estimating them as described in this paper.
Additional file 1: Table S1. Estimated additive genetic variances. Description: Sum of variance of marker genotypes ( σ 2 x ) and estimated additive genetic variances ( σ 2 a ) with standard errors (se), for each trait (presented in columns), for all parental lines (S, LR, and LW) and their crossbred (CB) (presented in rows). Table S2. Estimated residual variances. Description: Estimated residual variances ( σ 2 e ) with standard errors (se), for each trait (presented in columns), for all parental lines (S, LR, and LW) and their crossbred (CB) (presented in rows).